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Abstract 

This paper explores the annotation and classi¬ 
fication of students’ revision behaviors in ar¬ 
gumentative writing. A sentence-level revi¬ 
sion schema is proposed to capture why and 
how students make revisions. Based on the 
proposed schema, a small corpus of student 
essays and revisions was annotated. Stud¬ 
ies show that manual annotation is reliable 
with the schema and the annotated informa¬ 
tion helpful for revision analysis. Further¬ 
more, features and methods are explored for 
the automatic classification of revisions. In¬ 
trinsic evaluations demonstrate promising per¬ 
formance in high-level revision classification 
(surface vs. text-based). Extrinsic evaluations 
demonstrate that our method for automatic re¬ 
vision classification can be used to predict a 
writer’s improvement. 

1 Introduction 

Rewriting is considered as an important factor of 
successful writing. Research shows that expert writ¬ 
ers revise in ways different from inexperienced writ¬ 
ers (Faigley and Witte, 1981). Recognizing the im¬ 
portance of rewriting, more and more efforts are be¬ 
ing made to understand and utilize revisions. There 
are rewriting suggestions made by instructors (Wells 
et al., 2013), studies modeling revisions for error 
correction (Xue and Hwa, 2010; Mizumoto et al., 
2011) and tools aiming to help students with rewrit¬ 
ing (Elireview, 2014; Lightside, 2014). 

While there is increasing interest in the improve¬ 
ment of writers’ rewriting skills, there is still a lack 
of study on the details of revisions. First, to find 
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out what has been changed (defined as revision ex¬ 
traction in this paper), a typical approach is to ex¬ 
tract and analyze revisions at the word/phrase level 
based on edits extracted with character-level text 
comparison (Bronner and Monz, 2012; Daxenberger 
and Gurevych, 2012). The semantic information 
of sentences is not considered in the character-level 
text comparison, which can lead to errors and loss 
of information in revision extraction. Second, the 
differentiation of different types of revisions (de¬ 
fined as revision categorization) is typically not 
fine-grained. A common categorization is a binary 
classification of revisions according to whether the 
information of the essay is changed or not (e.g. 
text-based vs. surface as defined by Faigley and 
Witte (1981)). This categorization ignores poten¬ 
tially important differences between revisions under 
the same high-level category. For example, chang¬ 
ing the evidence of a claim and changing the rea¬ 
soning of a claim arc both considered as text-based 
changes. Usually changing the evidence makes a pa¬ 
per more grounded, while changing the reasoning 
helps with the paper’s readability. This could indi¬ 
cate different levels of improvement to the original 
paper. Finally, for the automatic differentiation of 
revisions (defined as revision classification), while 
there arc works on the classification of Wikipedia 
revisions (Adler et al., 2011; Bronner and Monz, 
2012; Daxenberger and Gurevych, 2013), there is 
a lack of work on revision classification in other 
datasets such as student writings. It is not clear 
whether current features and methods can still be 
adapted or new features and methods arc required. 

To address the issues above, this paper makes 
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Draft 1 


: Draft 2 


h 


1. In the circle I wou l d p l ac e Bill 
Clinton because he had and affair 
with his aide. 

i Character-level; 

r. 1 . 

Stepl . Edit segmentation according to the result of a text 
difference algorithm: 

(equal, 'In the"), (insert, "third"), (equal, "circle"), (insert, "of 
jHell, sinners.... wind"), (equal, "Bill Clinton"), (insert, "would 
be in this level"), (equal, "because he ...aide") 

Step 2. Merge continuous edit segments into revision units: 
j(Align: 1-> 1,2,3, Type: Factual) 


1. In the third circle of Hell, sinners have uncontrollable lust. 


2. The carnal sinners in this level are punished bv a howling, endless wind. 

3. Bill Clinton would be in this level because he had an affair with his aide. 

! Sentence-level \ 

Stepl . Align sentences 
;(1->3), (null->1), (null->2) 

jstep2. Extract revisions from aligned sentences 
(Align: 1->3, Op: Modify, Purpose: Word Usage/Clarity), 

(Align: 1->3, Op: Modify, Purpose: Organization), 

(Align: null->1: Op: Add, Purpose: 
yVarrant/Reasoning/Backing), 

(Align: null->2, Op: Add, Purpose: General Content) 


Figure 1: In the example, words in sentence 1 of Draft 1 are rephrased and reordered to sentence 3 of Draft 
2. Sentences 1 and 2 in Draft 2 are newly added. Our method first marks 1 and 3 as aligned and the other two 
sentences of Draft 2 as newly added based on semantic similarity of sentences. The purposes and operations 
are then marked on the aligned pairs. In contrast, previous work extracts differences between drafts at the 
character level to get edit segments. The revision is extracted as a set of sentences covering the contiguous 
edit segments. Sentence 1 in Draft 1 is wrongly marked as being modified to 1, 2, 3 in Draft 2 because 
character-level text comparison could not identify the semantic similarity between sentences. 


the following efforts. First, we propose that it is 
better to extract revisions at a level higher than 
the character level, and in particular, explore the 
sentence-level. This avoids the misalignment errors 
of character-level text comparisons. Finer-grained 
studies can still be done on the sentence-level revi¬ 
sions extracted, such as fluency prediction (Chae and 
Nenkova, 2009), error correction (Cahill et ah, 2013; 
Xue and Hwa, 2014), statement strength identifica¬ 
tion (Tan and Lee, 2014), etc. Second, we propose 
a sentence-level revision schema for argumentative 
writing, a common form of writing in education. In 
the schema, categories are defined for describing an 
author’s revision operations and revision purposes. 
The revision operations can be directly decided ac¬ 
cording to the results of sentence alignment, while 
revision purposes can be reliably manually anno¬ 
tated. We also do a coipus study to demonstrate the 
utility of sentence-level revisions for revision anal¬ 
ysis. Finally, we adapt features from Wikipedia re¬ 
vision classification work and explore new features 
for our classification task, which differs from prior 
work with respect to both the revision classes to be 
predicted and the sentence-level revision extraction 
method. Our models are able to distinguish whether 
the revisions are changing the content or not. For 


fine-grained classification, our models also demon¬ 
strate good performance for some categories. Be¬ 
yond the classification task, we also investigate the 
pipelining of revision extraction and classification. 
Results of an extrinsic evaluation show that the au¬ 
tomatically extracted and classified revisions can be 
used for writing improvement prediction. 

2 Related work 

Revision extraction To extract the revisions for 
revision analysis, a widely chosen strategy uses 
character-based text comparison algorithms first and 
then builds revision units on the differences ex¬ 
tracted (Bronner and Monz, 2012; Daxenberger and 
Gurevych, 2013). While theoretically revisions ex¬ 
tracted with this method can be more precise than 
sentence-level extractions, it could suffer from the 
misalignments of revised content due to character- 
level text comparison algorithms. For example, 
when a sentence is rephrased, a character-level text 
comparison algorithm is likely to make alignment 
errors as it could not recognize semantic similarity. 
As educational research has suggested that revision 
analysis can be done at the sentence level (Faigley 
and Witte, 1981), we propose to extract revisions at 
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the sentence level based on semantic sentence align¬ 
ment instead. Figure 1 provides an example com¬ 
paring revisions annotated in our work to revisions 
extracted in prior work (Bronner and Monz, 2012). 
Our work identifies the fact that the student added 
new information to the essay and modified the orga¬ 
nization of old sentences. The previous work, how¬ 
ever, extracts all the modifications as one unit and 
cannot distinguish the different kinds of revisions 
inside the unit. Our method is similar to Lee and 
Webster’s method (Lee and Webster, 2012), where a 
sentence-level revision corpus is built from college 
students’ ESL writings. However, their corpus only 
includes the comments of the teachers and does not 
have every revision annotated. 

Revision categorization In an early educational 
work from Faigley and Witte (1981), revisions arc 
categorized to text-based change and surface change 
based on whether they changed the information of 
the essay or not. A similar categorization (factual 
vs. fluency) was chosen by Bronner and Monz 
(2012) for classifying Wikipedia edits. However, 
many differences could not be captured with such 
coarse grained categorizations. In other works on 
Wikipedia revisions, finer categorizations of revi¬ 
sions were thus proposed: vandalism, paraphrase, 
markup, spelling/grammar, reference, information, 
template, file etc. (Pfeil et al., 2006; Jones, 2008; 
Liu and Ram, 2009; Daxenberger and Gurevych, 
2012). Coipus studies were conducted to analyze 
the relationship between revisions and the quality 
of Wikipedia papers based on the categorizations. 
Unfortunately, their categories arc customized for 
Wikipedia revisions and could not easily be applied 
to educational revisions such as ours. In our work, 
we provide a fine-grained revision categorization de¬ 
signed for argumentative writing, a common form of 
writing in education, and conduct a corpus study to 
analyze the relationship between our revision cate¬ 
gories and paper improvement. 

Revision classification Features and methods arc 
widely explored for Wikipedia revision classifica¬ 
tions (Adler et ah, 2011; Mola-Velasco, 2011; Bron¬ 
ner and Monz, 2012; Daxenberger and Gurevych, 
2013; Ferschke et ah, 2013). Classification tasks in¬ 
clude binary classification for coarse categories (e.g. 
factual vs. fluency) and multi-class classification for 


fine-grained categories (e.g. 21 categories defined 
by Daxenberger and Gurevych (2013)). Results 
show that the binary classifications on Wikipedia 
data achieve a promising result. Classification of 
finer-grained categories is more difficult and the dif¬ 
ficulty varies across different categories. In this 
paper we explore whether the features used in 
Wikipedia revision classification can be adapted to 
the classification of different categories of revisions 
in our work. We also utilize features from research 
on argument mining and discourse parsing (Burstein 
et ah, 2003; Burstein and Marcu, 2003; Sporleder 
and Lascarides, 2008; Falakmasir et ah, 2014; Braud 
and Denis, 2014) and evaluate revision classification 
both intrinsically and extrinsically. Finally, we ex¬ 
plore end-to-end revision processing by combining 
automatic revision extraction and categorization via 
automatic classification in a pipelined manner. 

3 Sentence-level revision extraction and 
categorization 

This section describes our work for sentence-level 
revision extraction and revision categorization. A 
corpus study demonstrates the use of the sentence- 
level revision annotations for revision analysis. 

3.1 Revision extraction 

As stated in the previous section, our method takes 
semantic information into consideration when ex¬ 
tracting revisions and uses the sentence as the ba¬ 
sic semantic unit; besides the utility of sentence re¬ 
visions for educational analysis (Faigley and Witte, 
1981; Lee and Webster, 2012), automatic sentence 
segmentation is quite accurate. Essays are split into 
sentences first, then sentences across the essays arc 
aligned based on semantic similarity. 1 An added 
sentence or a deleted sentence is treated as aligned 
to null as in Figure 1. The aligned pairs where the 
sentences in the pair arc not identical arc extracted as 
revisions. For the automatic alignment of sentences, 

'We plan to also explore revision extraction at the clause 
level in the future. Our approach can be adapted to the clause 
level by segmenting the clauses first and aligning the segmented 
clauses after. A potential benefit is that clauses are often the ba¬ 
sic units of discourse structures, so extracting clause revisions 
will allow the direct use of discourse parser outputs (Feng and 
Hirst, 2014; Lin et al., 2014). However, potential problems are 
that clauses contain less information for alignment decisions 
and clause segmentation is noisier. 
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we used the algorithm in our prior work (Zhang 
and Litman, 2014) which considers both sentence 
similarity (calculated using TF*IDF score) and the 
global context of sentences. 

3.2 Revision schema definition 

As shown in Figure 2, two dimensions are consid¬ 
ered in the definition of the revision schema: the au¬ 
thor’s behavior (revision operation) and the reason 
for the author’s behavior (revision purpose). 

Revision operations include three categories: 
Add, Delete, Modify. The operations arc decided 
automatically after sentences get aligned. For ex¬ 
ample, in Figure 1 where Sentence 3 in Draft 2 is 
aligned to sentence 1 in Draft 1, the revision op¬ 
eration is decided as Modify. The other two sen¬ 
tences arc aligned to null, so the revision operations 
of these alignments are both decided as Add. 

The definitions of revision puiposes come 
from several works in argumentative writing 
and discourse analysis. Claims/Ideas, War¬ 
rant/Reasoning/Backing, Rebuttal/Reservation, Ev¬ 
idence come from Claim, Rebuttal, Warrant, Back¬ 
ing, Grounds in Toulmin’s model (Kneupper, 1978). 
General Content comes from Introductory mate¬ 
rial in the essay-based discourse categorization of 
Burstein et al. (2003). The rest come from the cat¬ 
egories within the surface changes of Faigley and 
Witte (1981). Examples of all categories are shown 
in Table 1. These categories can further be mapped 
to surface and text-based changes defined by Faigley 
and Witte (1981), as shown in Figure 2. 

Note that while our categorization comes from the 
categorization of argumentative writing elements, a 
key difference is that our categorization focuses on 
revisions. For example, while an evidence revision 
must be related to the evidence element of the essay, 
the reverse is not necessarily true. The modifications 
on an evidence sentence could be just a correction of 
spelling errors rather than an evidence revision. 

3.3 Data annotation 

Our data consists of the first draft (Draft 1) and sec¬ 
ond draft (Draft 2) of papers written by high school 
students taking English writing courses; papers were 
revised after receiving and generating peer feed¬ 
back. Two assignments (from different teachers) 
have been annotated so far. Coipus C1 comes from 


\ Organization ; M 

Surface ponventions/Grammar/SpeNingi M 
jWord UsageClarity; M 

; Claims/Ideas ! A DM 
t ^ [Warrant/Reasoning/Backing ; A D M 
based 1 Rebuttal/Reservation 1 A D M 
\ General Content ; ADM 

i____J 

^Evidence !. A D M 

i WBV 


Figure 2: For the revision puipose, 8 categories are 
defined. These categories can be mapped to surface 
and text-based changes. Revision operations include 
Add, Delete, Modify (A, D, M in the figure). Only 
text-based changes have Add and Delete operations. 

an AP-level course, contains papers about Dante’s 
Inferno and contains drafts from 47 students, with 
1262 sentence revisions. A Draft 1 paper contains 
38 sentences on average and a Draft 2 paper con¬ 
tains 53. Examples from this corpus are shown in 
Table 1. After data was collected, a score from 0 
to 5 was assigned to each draft by experts (for re¬ 
search prior to our study). The score was based on 
the student’s performance including whether the stu¬ 
dent stated the ideas clearly, had a clear paper or¬ 
ganization, provided good evidence, chose the cor¬ 
rect wording and followed writing conventions. The 
class’s average score improved from 3.17 to 3.74 af¬ 
ter revision. Corpus C2 (not AP) contains papers 
about the poverty issues of the modern reservation 
and contains drafts from 38 students with 495 revi¬ 
sions; expert ratings are not available. Papers in C2 
are shorter than Cl; a Draft 1 paper contains 19 sen¬ 
tences on average and a Draft 2 paper contains 26. 

Two steps were involved in the revision scheme 
annotation of these corpora. In the first step, sen¬ 
tences between the two drafts were aligned based 
on semantic similarity. The kappa was 0.794 for 
the sentence alignment on C1. Two annotators dis¬ 
cussed about the disagreements and one annotator’s 
work was decided to be better and chosen as the gold 
standard after discussion. The sentence alignment 
on C2 is done by one annotator after his annotation 
and discussion of the sentence alignment on C1. In 
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Codes 

Claims/Ideas: change of the position or claim being argued for 

Conventions/Grammar/Spelling : changes to fix spelling or grammar errors, misusage of punc¬ 
tuation or to follow the organizational conventions of academic writing 

Example 

Draft 1: (1, “Saddam Hussein and Osama Bin Laden come to mind when mentioning wrath¬ 
ful people”) 

Draft 2: (1, “Fidel Castro comes to mind when mentioning wrathful people”) 

Revisions 

(l-> 1, Modify, “claims/ideas”), (l-> 1, Modify, “conventions/grammar/spelling”) 

Codes 

Evidence', change of facts, theorems or citations for supporting claims/ideas 
Rebuttal/Reservation: change of development of content that rebut current claim/ideas 

Example 

Draft 1: (1, “In this circle I would place Fidel.”) 

Draft 2: (1, “In the circle I would place Fidel”), (2, “He was annoyed with the existence of 
the United States and used his army to force them out of his country”), (3, “Although 
Fidel claimed that this is for his peoples’ interest, it could not change the fact that he is a 
wrathful person.”) 

Revisions 

(null->2, “Add”, “Evidence”), (null->3, “Add”, “Rebuttal/Reservation”) 

Codes 

Word-usage/Clarity: change of words or phrases for better representation of ideas 

Organization : changes to help the author get a better flow of the paper 
Warrant/Reasoning/Backing', change of principle or reasoning of the claim 

General Content : change of content that do not directly support or rebut claims/ideas 

Example 

As in Figure 1 


Table 1: Examples of different revision purposes. Note that in the second example the alignment is not 
extracted as a revision when the sentences arc identical. 


the second step, revision purposes were annotated 
on the aligned sentence pairs. Each aligned sentence 
pair could have multiple revision purposes (although 
rare in the annotation of our current corpus). The 
full papers were also provided to the annotators for 
context information. The kappa score for the revi¬ 
sion purpose annotation is shown in Table 2, which 
demonstrates that our revision purposes could be an¬ 
notated reliably by humans. Again one annotator’s 
annotation is chosen as the gold standard after dis¬ 
cussion. Distribution of different revision purposes 
is shown in Tables 3 and 4. 

3.4 Corpus study 

To demonstrate the utility of our sentence-level revi¬ 
sion annotations for revision analysis, we conducted 
a corpus study analyzing relations between the num¬ 
ber of each revision type in our schema and stu¬ 
dent writing improvement based on the expert paper 
scores available for Cl. In particular, the number of 
revisions of different categories arc counted for each 
student. Pearson correlation between the number of 


revisions and the students’ Draft 2 scores is calcu¬ 
lated. Given that the student’s Draft 1 and Draft 2 
scores are significantly correlated (p < 0.001, R = 
0.632), we controlled for the effect of Draft 1 score 
by regressing it out of the correlation. 2 We expect 
surface changes to have smaller impact than text- 
based changes as Faigley and Witte (1981) found 
that advanced writers make more text-based changes 
comparing to inexperienced writers. 

As shown by the first row in Table 5, the overall 
number of revisions is significantly correlated with 
students’ writing improvement. However, when 
we compare revisions using Faigley and Witte’s 
binary categorization, only the number of text- 
based revisions is significantly correlated. Within 
the text-based revisions, only Claims/Ideas, War¬ 
rant/Reasoning/Backing and Evidence are signifi¬ 
cantly correlated. These findings demonstrate that 
revisions at different levels of granularity have dif¬ 
ferent relationships to students’ writing success, 

2 Such partial correlations are one common way to measure 
learning gain in the tutoring literature, e.g. (Baker et al., 2004). 
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Revision Purpose 

Kappa (Cl) 

Kappa (C2) 

Surface 



Organization 

1 

1 

Conventions 

0.74 

0.87 

Word-usage 

1 

1 

Text-based 



Claim 

0.76 

0.89 

Warrant 

0.78 

0.85 

Rebuttal 

1 

1 

General Content 

0.76 

0.80 

Evidence 

1 

1 


Rev Purpose 

# Add 

# Delete 

#Modify 

Total 

800 

96 

366 

Surface 

0 

0 

297 

Organization 

0 

0 

35 

Conventions 

0 

0 

84 

Word-usage 

0 

0 

178 

Text-based 

800 

96 

69 

Claim 

80 

23 

8 

Warrant 

335 

40 

14 

Rebuttal 

1 

0 

0 

General 

289 

23 

42 

Evidence 

95 

10 

5 


Table 2: Agreement of annotation on each category. 


which suggests that our schema is capturing salient 
characteristics of writing improvement. 

While correlational, these results also suggest the 
potential utility of educational technologies based 
on fine-grained revision analysis. For teachers, sum¬ 
maries of the revision purposes in a particular paper 
(e.g. “The author added more reasoning sentences 
to his old claim, and changed the evidence used to 
support the claim.”) or across the papers of multiple 
students (e.g. “90% of the class made only surface 
revisions”) might provide useful information for pri¬ 
oritizing feedback. Fine-grained revision analysis 
might also be used to provide student feedback di¬ 
rectly in an intelligent tutoring system. 

4 Revision classification 

In the previous section we described our revision 
schema and demonstrated the utility of it. This sec¬ 
tion investigates the feasibility of automatic revision 
analysis. We first explore classification assuming we 
have revisions extracted with perfect sentence align¬ 
ment. After that we combine revision extraction and 
revision classification in a pipelined manner. 

4.1 Features 

As shown in Figure 3, besides using unigram fea¬ 
tures as a baseline, our features are organized into 
Location , Textual, and Language groups following 
prior work (Adler et ah, 2011; Bronner and Monz, 
2012; Daxenberger and Gurevych, 2013). 

Baseline: unigram features. Similarly to Dax¬ 
enberger and Gurevych (2012), we choose the count 
of unigram features as a baseline. Two types of uni- 


Table 3: Distribution of revisions of corpus Cl. 


Rev Purpose 

# Add 

# Delete 

#Modify 

Total 

280 

53 

162 

Surface 

0 

0 

141 

Organization 

0 

0 

1 

Conventions 

0 

0 

29 

Word-usage 

0 

0 

111 

Text-based 

280 

53 

21 

Claim 

42 

12 

4 

Warrant 

153 

23 

10 

Rebuttal 

0 

0 

0 

General 

60 

13 

6 

Evidence 

25 

5 

1 


Table 4: Distribution of revisions of corpus C2. 


Revision Purpose 

R 

P 

# All revisions (N = 1262) 

0.516 

<0.001 

# Surface revisions 

0.137 

0.363 

# Organization 

0.201 

0.180 

# Conventions 

-0.041 

0.778 

# Word-usage/Clarity 

0.135 

0.371 

# Text-based revisions 

0.546 

<0.001 

# Claim/Ideas 

0.472 

0.001 

# Warrant 

0.462 

0.001 

# General 

0.259 

0.083 

# Evidence 

0.415 

0.004 


Table 5: Partial correlation between number of re¬ 
visions and Draft 2 score on corpus C1 (partial cor¬ 
relation regresses out Draft 1 score); rebuttal is not 
evaluated as there is only 1 occurrence. 
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Draft 1 

5 paragraphs, the third paragraph contains 5 sentences 

Draft 2 

7 paragraphs, the third paragraph contains 9 sentences 

In Paragraph 3: 

1. The third circle is for Wrathful people. 

2. Saddam Hussein and Osama Bin Laden come to mind 
when mentioning wrathful people. 

n Paragraph 3: 

1. The third circle contains wrathful people. 

2. Fidel Castro comes to mind when mentioning wrathful 
people. 



Location 


Textual 


Language 


Unigram 


Unigrams of all: 

’Saddam", "Hussein", 
and", "Osama”, "Bin", 
Laden", "come", "to", 
mind”, "when", 
mentioning", "wrathfuF, 
people". "FkJef, 
Castro", "comes") 
Unigrams of diff: 
"Saddam", "Hussein", 
"and”, "Osama", "Bin", 
Laden", "Fidef, 
Castro", "come", 
comes") 


-irst sentence of paragraph? 

Old Draft: No New Draft: No 
_ast sentence of paragraph? 

Old Draft: No New Draft: No 
-irst paragraph of the essay? 

Old Draft: No New Draft: No 
.ast paragraph of the essay? 

Old Draft: No New Draft: No 
Sentence in the paragraph(Ratio): 
Old Draft: (2-1 )/(5-1) = 0.25 New 
Draft: 0.125 Diff:-0.125 
Sentence in the paragraph (Num): 
Old Draft: 2 New Draft: 2 Diff: 0 
Daragraph in the essay (Ratio) 


Named entity: 

# of PERSON? 

Old Draft: 2 New Draft: 1 Diff: -1 
H of LOCATION? 

Old Draft: 0 New Draft: 0 
Discourse markers: 

Contains ’because", ’due to”? 
Old Draft: No New Draft: No 

Sentence difference: 

Diff in commasrO 
Diff in digits: 0 

Edit distance: 31 

Revision Operation: Modify 


POS tags 

V of adjectives: 

Old Draft: 1 New Draft: 1 Diff: 

D 

fof nouns: 

Ratio of adjectives: 

Old Draft: 0.077 New Draft: 
0.111 Diff: 0.034 
Ratio of nouns: 

Spelling mistakes: 

Old Draft: 0 New Draft: 0 Diff: 0 
Grammar mistakes: 

Old Draft: 0 New Draft: 0 Diff: 0 


Figure 3: An example of features extracted for the aligned sentence pair (2->2). 


grams are explored. The first includes unigrams ex¬ 
tracted from all the sentences in an aligned pair. The 
second includes unigrams extracted from the differ¬ 
ences of sentences in a pair. 

Location group. As Fa.1akma.sir et al. (2014) 
have shown, the positional features are helpful for 
identifying thesis and conclusion statements. Fea¬ 
tures used include the location of the sentence and 
the location of paragraph , 3 

Textual group. A sentence containing a spe¬ 
cific person’s name is more likely to be an exam¬ 
ple for a claim; sentences containing “because” are 
more likely to be a sentence of reasoning; a sen¬ 
tence generated by text-based revisions is possibly 
more different from the original sentence compared 
to a sentence generated by surface revisions. These 
intuitions are operationalized using several feature 
groups: Named entity features 4 (also used in Bron- 
ner and Monz (2012)’s Wikipedia revision classi¬ 
fication task), Discourse marker features (used by 

3 Since Add and Delete operations have only one sentence in 
the aligned pair, the value of the empty sentence is set to 0. 

4 Stanford parser (Klein and Manning, 2003) is used in 
named entity recognition. 


Burstein et al. (2003) for discourse structure identi¬ 
fication), Sentence difference features and Revision 
operation (similar features are used by Daxenberger 
and Gurevych (2013)). 

Language group. Different types of sentences 
can have different distributions in POS tags (Daxen¬ 
berger and Gurevych, 2013). The difference in the 
number of spelling/grammar mistakes 5 is a possi¬ 
ble indicator as Conventions/Grammar/Spelling re¬ 
visions probably decrease the number of mistakes. 

4.2 Experiments 

Experiment 1: Surface vs. text-based As the cor¬ 
pus study in Section 3 shows that only text-based 
revisions predict writing improvement, our first ex¬ 
periment is to check whether we can distinguish be¬ 
tween the surface and text-based categories. The 
classification is done on all the non-identical aligned 
sentence pairs with Modify operations 6 . We choose 
10-fold (student) cross-validation for our experi- 

5 The spelling/grammar mistakes are detected using the lan- 
guagetool toolkit (https://www.languagetool.org/). 

6 Add and Delete pairs are removed from this task as only 
text-based changes have Add and Delete operations. 
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N = 366 

Precision 

Recall 

F-score 

Majority 

32.68 

50.00 

37.12 

Unigram 

45.53 

49.90 

46.69 

All features 

62.89* 

58.19* 

55.30* 


Table 6: Experiment 1 on corpus Cl (Surface vs. 
Text-based): average unweighted precision, recall, 
F-score from 10-fold cross-validation; * indicates 
significantly better than majority and unigram. 

ment. Random Forest of the Weka toolkit (Hall et 
al., 2009) is chosen as the classifier. Considering the 
data imbalance problem, the training data is sampled 
with a cost matrix decided according to the distribu¬ 
tion of categories in training data in each round. All 
features arc used except Revision operation (since 
only Modify revisions arc in this experiment). 

Experiment 2: Binary classification for each 
revision purpose category In this experiment, we 
test whether the system could identify if revisions of 
each specific category exist in the aligned sentence 
pair or not. The same experimental setting for sur¬ 
face vs. text-based classification is applied. 

Experiment 3: Pipelined revision extraction 
and classification In this experiment, revision ex¬ 
traction and Experiment 1 arc combined together as 
a pipelined approach 7 . The output of sentence align¬ 
ment is used as the input of the classification task. 
The accuracy of sentence alignment is 0.9177 on Cl 
and 0.9112 on C2. The predicted Add and Delete re¬ 
visions arc directly classified as text-based changes. 
Features arc used as in Experiment 1. 

4.3 Evaluation 

In the intrinsic evaluation, we compare different fea¬ 
ture groups' importance. Paired t-tests arc utilized 
to compare whether there are significant differences 
in performance. Performance is measured using un¬ 
weighted F-score. In the extrinsic evaluation, we re¬ 
peat the corpus study from Section 3 using the pre¬ 
dicted counts of revision. If the results in the intrin¬ 
sic evaluation arc solid, we expect that a similar con¬ 
clusion could be drawn with the results from either 
predicted or manually annotated revisions. 

Intrinsic evaluation Tables 6 and 7 present the 
results of the classification between surface and text- 

7 We leave pipelined fine-grained classification to the future. 
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N = 162 

Precision 

Recall 

F-score 

Majority 

31.57 

40.00 

33.89 

Unigram 

50.91 

50.40 

51.79 

All features 

56.11* 

55.03* 

54.49* 


Table 7: Experiment 1 on corpus C2. 


based changes on corpora Cl and C2. Results show 
that for both corpora, our learned models signifi¬ 
cantly beat majority and unigram baselines for all 
of unweighted precision, recall and F-score; the F- 
score for both corpora is approximately 55. 

Tables 8 and 9 show the classification results for 
the fine-grained categories. Our results are not sig¬ 
nificantly better than the unigram baseline on Ev¬ 
idence of Cl, C2 and Claim of C2. While the 
poor performance on Evidence might be due to the 
skewed class distribution, our model also performs 
better on Conventions where there are not many in¬ 
stances. For the categories where our model per¬ 
forms significantly better than the baselines, we see 
that the location features are the best features to add 
to unigrams for the text-based changes (significantly 
better than baselines except Evidence), while the 
language and textual features are better for surface 
changes. We also see that using all features does not 
always lead to better results, probably due to over 
fitting. Replicating experiments in two corpora also 
demonstrates that our schema and features can be 
applied across essays with different topics (Dante 
vs. poverty) written in different types of courses (ad¬ 
vanced placement or not) with si mi lar results. 

For the intrinsic evaluation of our pipelined ap¬ 
proach (Experiment 3), as the revisions extracted 
are not exactly the same as the revisions annotated 
by humans, we only report the unweighted precision 
and unweighted recall here; Cl (p: 40.25, r: 45.05) 
and C2 (p: 48.08, r: 54.30). Paired t-test shows that 
the results significantly drop compared to Tables 6 
and 7 because of the errors made in revision extrac¬ 
tion, although still outperform the majority baseline. 

Extrinsic evaluation According to Table 10 , the 
conclusions drawn from the predicted revisions and 
annotated revisions are similar (Table 5). Text-based 
changes are significantly correlated with writing im¬ 
provement, while surface changes arc not. We can 
also see that the coefficient of the predicted text- 





N = 1261 

Text-based 

Surface 

Experiments 

Claim 

Warrant 

General 

Evidence 

Org. 

Word 

Conventions 

Majority 

39.24 

32.25 

29.38 

27.47 

25.49 

27.75 

31.23 

Unigram 

65.64 

63.24 

69.21 

60.40 

49.23 

62.07 

56.05 

All features 

66.20 

70.76* 

72.65* 

60.57 

54.01* 

73.79* 

70.95* 

Textual+unigram 

71.54* 

68.13* 

70.76 

59.73 

52.62* 

75.92* 

71.98* 

Language+unigram 

67.76* 

66.27* 

69.23 

59.81 

49.21 

65.01* 

69.62* 

Location+unigram 

69.90* 

67.78* 

72.94* 

59.14 

49.25 

62.40 

66.85* 


Table 8: Experiment 2 on corpus Cl: average unweighted F-score from 10-fold cross-validation; * indicates 
significantly better than majority and unigram baselines. Rebuttal is removed as it only occurred once. 


N = 494 

Text-based 

Surface 

Experiments 

Claim 

Warrant 

General 

Evidence 

Word 

Conventions 

Majority 

24.89 

32.05 

28.21 

27.02 

13.00 

32.67 

Unigram 

54.34 

64.06 

55.00 

56.99 

49.56 

60.09 

All features 

50.22 

67.50* 

56.50 

53.90 

56.07* 

77.78* 

Textual+unigram 

52.19 

65.79 

55.74 

56.08 

54.19* 

76.08* 

Language+unigram 

50.54 

68.24* 

56.42 

56.15 

58.83* 

78.92* 

Loc ation+unigram 

53.20 

66.45* 

58.08* 

52.57 

51.55 

75.39* 


Table 9: Experiment 2 on corpus C2; Organization is removed as it only occurred once. 


Predicted purposes 

R 

P 

#A11 revisions (N = 1262) 

0.516 

<0.001 

#Surface revisions 

0.175 

0.245 

#Text-based revisions 

0.553 

<0.001 

Pipeline predicted purposes 

R 

p 

#A11 (predicted N = 1356) 

0.509 

<0.001 

#Surface revisions 

0.230 

0.124 

#Text-based revisions 

0.542 

<0.001 


Table 10: Partial correlation between number of pre¬ 
dicted revisions and Draft 2 score on corpus Cl. 
(Upper: Experiment 1, Lower: Experiment 3) 

based change correlation is close to the coefficient 
of the manually annotated results. 

5 Conclusion and current directions 

This paper contributes to the study of revisions for 
argumentative writing. A revision schema is defined 
for revision categorization. Two corpora are anno¬ 
tated based on the schema. The agreement study 
demonstrates that the categories defined can be re¬ 
liably annotated by humans. Study of the annotated 


corpus demonstrates the utility of the annotation for 
revision analysis. For automatic revision classifica¬ 
tion, our system can beat the unigram baseline in 
the classification of higher level categories (surface 
vs. text-based). However, the difficulty increases for 
fine-grained category classification. Results show 
that different feature groups are required for differ¬ 
ent purpose classifications. Results of extrinsic eval¬ 
uations show that the automatically analyzed revi¬ 
sions can be used for writer improvement prediction. 

In the future, we plan to annotate revisions 
from different student levels (college-level, grad¬ 
uate level, etc.) as our current annotations lack 
full coverage of all revision purposes (e.g., “Re¬ 
buttal/Reservation” rarely occurs in our high school 
corpora). We also plan to annotate data from other 
educational genres (e.g. technical reports, science 
papers, etc.) to see if the schema generalizes, and to 
explore more category-specific features to improve 
the fine-grained classification results. In the longer- 
term, we plan to apply our revision predictions in 
a summarization or learning analytics systems for 
teachers or a tutoring system for students. 
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